Single cell lineage reconstruction using distance-based algorithms and the R package, DCLEAR

Background DCLEAR is an R package used for single cell lineage reconstruction. The advances of CRISPR-based gene editing technologies have enabled the prediction of cell lineage trees based on observed edited barcodes from each cell. However, the performance of existing reconstruction methods of cell lineage trees was not accessed until recently. In response to this problem, the Allen Institute hosted the Cell Lineage Reconstruction Dream Challenge in 2020 to crowdsource relevant knowledge from across the world. Our team won sub-challenges 2 and 3 in the challenge competition. Results The DCLEAR package contained the R codes, which was submitted in response to sub-challenges 2 and 3. Our method consists of two steps: (1) distance matrix estimation and (2) the tree reconstruction from the distance matrix. We proposed two novel methods for distance matrix estimation as outlined in the DCLEAR package. Using our method, we find that two of the more sophisticated distance methods display a substantially improved level of performance compared to the traditional Hamming distance method. DCLEAR is open source and freely available from R CRAN and from under the GNU General Public License, version 3. Conclusions DCLEAR is a powerful resource for single cell lineage reconstruction.

This field lacked experimentation or analyses regarding the effectiveness of these proposed algorithms for comprehensive evaluation. To improve the evaluation process, the Allen Institute established The Cell Lineage Reconstruction DREAM Challenge [6]. The Allen Institute proposed three different sub-challenges to benchmark reconstruction algorithms of cell lineage trees: (1) the reconstruction of in vitro cell lineages of 76 trees with fewer than 100 cells; (2) the reconstruction of an in silico cell lineage tree of 1000 cells; (3) the reconstruction of an in silico cell lineage tree of 10,000 cells. Our proposed DCLEAR method won sub-challenges 2 and 3 of this challenge competition.
We outline and define the problem setting addressed in cell lineage reconstruction in the next section. We then present two core methods for distance matrix construction and outline how DCLEAR software may be applied to a simulated dataset.

Problem setup
Assume we have n number of training data pairs. Each data pair consists of a set of cell sequences and a true cell lineage tree. One example of such a pair is illustrated in Fig. 1.
For the ith data pair, let m i be the number of cell sequences in the ith data pair and let t be the sequence length. Let the sequence information in data pair i be written as C i , an m i × t matrix. The matrix element C i jk describes the jth sequence and the kth letter of the ith training data pair. Furthermore, we represent L i , the cell lineage tree structure, in the form of a Newick format string.
For example, if Fig. 1 represents ith data pair, (a) shows the sequence information. It has m i = 4 cell sequences, each sequence length has a length (t) of 10, and the first letter of the 3rd sequence is C i 3,1 = E . The cell lineage tree is shown in Fig. 1b. In Newick format, L i = ((1 : 0.5, 2 : 0.5) : 2, (3 : 1.2, 4 : 1) : 2). We built our model ( m(C;θ) ) with n training pairs. The input matrix, C is an m i × t sequence information matrix. The output is the Newick format string representing the tree structure while θ represents the parameter set related to model m(C; θ) , and θ represents the estimated parameter with n training data pairs. To evaluate the model, we have l number of unused data. The loss is indicated as L(m(C i ;θ), L i ) . The loss increases when the predicted tree structure ( m(C;θ) ) differs from the true tree structure ( L i ). We use the averaged loss metric (AL) defined below to evaluate the model m(C;θ). Next, we need to define the quantity L(L 1 , L 2 ) that represents the dissimilarity between the two lineage trees, L 1 and L 2 . The two widely used measures of this dissimilarity are the Robinson-Foulds (RF) distance [7] and the Triplet distance [8]. For triplet distance calculations, we sample three items among all items in the tree. We determine 5 C 3 = 10 possible cases. We check whether the tree structure of the three items in tree 1 and tree 2 are the same. For example, the five cases of {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 4, 5} and {2, 4, 5} have the same tree structure in both tree 1 and tree 2. The triplet score is defined as the number of cases with the same tree structure divided by the number of possible cases. Accordingly, the triplet score for our example is 5/10 = 0.5.
We will use AL RF to denote the RF distance and AL TP to denote the triplet distance.

Overview of the modeling architecture
The two main issues that need to be answered in the lineage reconstruction problem are (1) how should the model m(C; θ) be built and (2) how should θ be estimated? Our modeling architecture for m(C; θ) is described in Fig. 3. We divide the model into two parts: (1) estimating the distance between cells and (2) constructing a tree using a distance matrix. Let D(C) be a function for estimating the distance matrix for an m × t input sequence matrix, C, and let t(D) be a function for predicting the lineage tree for an m × m distance matrix, D. Note that a knowledge of the triangular components in D is sufficient for defining the distance matrix. Subsequently, the challenge becomes how D(·) and t(·) should be defined.

Choice of distance
One notion for calculating the distance is to define the distance function for the two sequences. Let d(C i· , C j· ; θ) = d ij be the calculated distance between the ith cell

Fig. 2
Two different trees, a L 1 and b L 2 , are presented to explain how the RF and triplet distances are defined. There are two possible cuts for each tree to separate items in the tree with two different sets of more than two items. The red slanted lines represent the possible cuts for separation and the jth cell obtained from the given cell information matrix C. The quantities C i· , i = 1, · · · , m represent the ith cell vector taken from C. The quantity d ij is the (i, j)th element of the cell distance matrix D. The next challenge becomes how d(·, ·) should be defined.

Hamming distance method
One naive approach to modeling d(·, ·) is to compute the Hamming distance expressed in equation (2): where 1(condition) is an indicator function with a value of 1 if the given condition is true and 0 otherwise. Note that the Hamming distance d H (C i· , C j· ) simply counts unit differences between the two sequences C i· and C j· . The simple calculation of the Hamming distance does not meet the challenges of the present study. Consider the cell differentiation process illustrated in Fig. 4. Let the 2nd and the 3rd leaf cells (dotted) have C 2· = 0AB-0 and C 3· = 00CB0 . The corresponding Hamming distance is d H (C 2· , C 3· ) = 3 . However, it is not reasonable to assign equal  weights to each target position. This deficiency is addressed by the weighted Hamming approach described in the following sub-sub-section.
The code for using the Hamming distance method is available from the phangorn package [9] using the dist.hamming function.

Weighted Hamming distance (WHD) method
An example of the cell diffusion process is illustrated in Fig. 4. Consider the objective of reconstructing the lineage tree based on information about the leaf nodes. What would be an appropriate measure for calculating the distance between 0AB-0 and 00CB0 (see the dotted circles in Fig. 4)? The character difference between ' A' and '0' would be less than the character difference between 'B' and 'C' as the initial state '0' can be differentiated to ' A' or to any other outcome states, whereas states 'B' and 'C' cannot be differentiated to other outcome states. In addition, the missing state '-' maybe any other state. Seeking a mathematical formula to accommodate these nuances, we propose the weighted Hamming distance (WHD) method: where, C il is the lth character in the ith cell sequence, and w C il is a weight associated with the character C il . The code for calculating the WHD method is available as the dist_ weighted_hamming function in the DCLEAR package. For the estimation of weight parameters, we used Bayesian hyperparameter optimization using the BayesianOptimization function in the rBayesianOptimization package [10].

k-mer replacement distance (KRD) method
The algorithm used to compute the k-mer replacement distance (KRD) method first uses the prominence of mutations in the character arrays to estimate the summary statistics used for the generation of the tree to be reconstructed. These statistics include the mutation rate, the mutation probability for each character in the array, the number of targets, and the number of cells. These estimated parameters were combined with pre-defined parameters, such as the number of cell divisions, to simulate multiple lineage trees starting from the non-mutated root. To generate trees with sizes and mutation distributions similar to the target tree, we generated 1000 lineage trees, each with 16 cell divisions of 216 leaves, a mutation rate of 0.1, arrays of 200 characters, 200 cells, and 30 states (' A'-'Z' to 'a'-'c' , with an outcome probability following a Gamma distribution with a shape of 0.1 and a rate of 2). Different possibilities for the k-mer distance method were then estimated from the simulated lineage trees and used to compute the distances between the input sequences in the character arrays of internal nodes and tips. The KRD method is available from the DCLEAR package using the dist_replacement function.

Tree construction
We use existing methods such as Neighbor-Joining (NJ), UPGMA, and FastMe [11][12][13] for tree construction from the estimated distance matrix, D. The NJ method is implemented as the nj function in the Analysis of Phylogenetics and Evolution (ape) package, UPGMA is implemented as the upgma function in the phangorn package, and FastMe is implemented as fastme.bal, and fastme.ols in the ape package.

Results
This section reports the experimental results of applying the Hamming distance, WHD, and KRD methods using existing tree construction methods (NJ, UPGMA, and FastMe). For the use of NJ, UPGMA, and FastMe, the nj function in the ape package [14] was used for the NJ method, the upgma function in the phangorn package [9] was used for the UPGMA method, the fastme.ols function in the ape was used for the FastMe method, and the fastme.bal function in the ape was used for FastMe with tree rearrangement.

Simulation dataset
The simulation dataset was generated from our simulation code. Parameter settings for the simulation were as follows: We iterated all 90 combinations ( 3 × 5 × 3 × 2 ) of the parameter 5 times, result in 450 simulation datasets. For each iteration, We generated five training sets and one evaluation set. The averaged performance of the 450 evaluation sets was reported.

Sub-challenge 2 and 3 datasets
The sub-challenge 2 and 3 datasets can be downloaded from the competition website of the Allen Institute Cell Lineage Reconstruction Challenge at https:// www. synap se. org/# !Synap se: syn20 692755 (accessed on 17 September 2021). The sub-challenge 2 dataset (the dataset for C.elegans cells) contained a 1000 cell tree from the 200 mutated/ non-mutated targets in each cell induced by simulation, and the sub-challenge 3 dataset (the dataset for mouse cells) had a 10,000 cell tree from the 1000 mutated/non-mutated targets in each cell induced by simulation. The different simulation models were used for sub-challenges 2 and 3. Both datasets had 100 trees. We randomly divided the subchallenge dataset into 75 trees for training and 25 trees for evaluation.

Experimental results
We outlined our experimental results in Fig. 5. The y-axis represented the RF distance, and the x-axis accommodated the different models. The Hamming distance, the KRD, and the WHD methods were used for distance calculation. NJ, UPGMA, and FastMe were methods for tree construction. For all three datasets, the KRD and the WHD methods displayed improved performance compared to the Hamming distance method. The performance achieved with the KRD was similar to that achieved with WHD. Among the tree construction methods, FastMe with tree rearrangement displayed improved performance compared to the other tree construction methods.

Discussion
In these experiments, we only compared the Hamming distance, the WHD, and the KRD methods using the three datasets. Gong et al. [6] presented the comparison of the WHD and the KRD with the other methods that participated in the Cell Lineage Reconstruction Dream Challenge (2020). Our proposed WHD method was used for sub-challenge 3, and the KRD method was used for sub-challenge 2. DCLEAR won both sub-challenges 2 and 3 but did not participate in sub-challenge 1. The sub-challenge 1 dataset has only ten target positions with two outcome states. The WHD and KRD methods were more powerful techniques with more complex settings, such as the existence of a dropout interval, the existence of missing target positions, and a larger number of outcome states. With the small target positions and the small outcome states of sub-challenge 1, the gain achieved by using the WHD and the KRD methods was not high. We were not able to show a comparison for sub-challenge 1 as DCLEAR did not participate in that sub-challenge. Thus, the future work of DCLEAR will contain an extension of the WHD and the KRD methods with small target and small outcome settings. Furthermore, for the WHD method, the hyperparameter tuning was performed using BayesianOptimization because the loss was not differentiable with respect to weight parameters. We could utilize the surrogate loss to address this non-differentiable loss [15].

Conclusions
DCLEAR is a powerful resource for single cell lineage reconstruction.

Simulating trees with barcodes
For a simple illustration of lineage reconstruction, we simulated data using the sim_ seqdata function in the DCLEAR package coded using the R [16] language. The sim_n parameter represented the number of cell samples. The prob_state parameter represented the cell state probability. The parameter m represented the number of targets. The parameter n_s represented the number of outcome states which equals the length quantified by prob_state. The parameter mu_d represented the mutation probability for each target position on every cell division. The parameter d represented the number of cell divisions. Finally, the parameter p_d represented the dropout probability of each target position for every cell division. The code is outlined below: With this code, we generate a lineage tree of 20 leaf barcodes with 10 target positions. The sD$seqs contains the sequence information, and the sD$tree has the true tree structure: We can also print character information of the simulated barcodes.
The initial cell state is '0000000000' . There are d = 12 number of cell divisions resulting in 2 12 leaf nodes. Every target position changes to a different outcome state for each cell division with a probability mu_d = 0.03 . The mutational positions randomly change to different outcomes, which follows a multinomial distribution with a probability p = out_prob . Every target position of leaf nodes randomly changes to a missing state ('-') with a probability p d = 0.005 . Finally, simn = 20 cells are randomly sampled. The true lineage tree structure of 20 cells ( simn = 20 ) is recorded in sD$tree. The tree structure corresponding to simn = 20 cells is illustrated in Fig. 6.
Subsequently, we prepared five lineage trees as a training dataset. Each lineage tree has 10 leaves with 40 barcode target positions: SDs is a list of 5 lineage trees. Each item of the SDs list is generated using sim_seqdata function: The ten barcodes of the first training data are shown below.
We observed interval missing (dropout) events by marking them with a sequence of hyphens. Weight parameters for the WHD method were trained using the WH_train function. We specified the weight for the initial state and the weight for the dropout state.
We printed out the state characters as indicated below: Since the initial state ('0') is in the first position and the dropout state ('-') is in the second position, we specify loc0=1 and locDroputout = 2: We simulated one evaluation datum and compared the ground truth tree with three generated trees using the Hamming, WHD, and KRD distances with the FastMe algorithm for tree construction: First, we calculated the distance matrix using the Hamming distance, the WHD, and the KRD methods: Second, we constructed the tree using the FastMe algorithm with the tree rearrangement using fastme.bal function. The ground truth tree and the three generated trees were demonstrated in Fig. 7. As an alternative to fastme.bal function, we could use different tree construction algorithms using nj, upgma, and fastme. ols functions.
As outlined in Fig. 7, it is challenging to compare and determine which generation is the optimal one. One useful metric for making a determination is the RF distance. We calculated and compared the RF distances for the various generations. The RF distances were calculated and compared as indicated below: The lower the value of the RF distance between the ground truth tree and the generated tree, the greater the similarity between the ground truth tree and the generated tree. Thus, for the given example data, the KRD method produced optimal results.